New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training (add tensorboard debug, and mAP Calculation) #206
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/home/tim/anaconda3/bin/python /home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py
/home/tim/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float
to np.floating
is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type
.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
/home/tim/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
2018-09-22 14:30:49.472148: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-22 14:30:49.562588: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-22 14:30:49.562875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce MX150 major: 6 minor: 1 memoryClockRate(GHz): 1.5315
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.36GiB
2018-09-22 14:30:49.562886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1)
Create YOLOv3 model with 9 anchors and 2 classes.
Traceback (most recent call last):
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 686, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 1 and 255 for 'Assign_360' (op: 'Assign') with input shapes: [1,1,1024,21], [255,1024,1,1].
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 529, in
yolo = Yolo()
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 73, in init
self.yolo_model = self.create_model(yolo_weights_path='model_data/yolo_weights.h5')
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 117, in create_model
model_body.load_weights(yolo_weights_path, skip_mismatch=True)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1161, in load_weights
f, self.layers, reshape=reshape)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 928, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2435, in batch_set_value
assign_op = x.assign(assign_placeholder)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 573, in assign
return state_ops.assign(self._variable, value, use_locking=use_locking)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py", line 276, in assign
validate_shape=validate_shape)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py", line 57, in assign
use_locking=use_locking, name=name)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2958, in create_op
set_shapes_for_outputs(ret)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2209, in set_shapes_for_outputs
shapes = shape_func(op)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2159, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 627, in call_cpp_shape_fn
require_shape_fn)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 255 for 'Assign_360' (op: 'Assign') with input shapes: [1,1,1024,21], [255,1024,1,1].
Process finished with exit code 1
Can't you help me to see what is wrong with my code ? THanks! |
Hi @chenyuqing I have try train_v2.py, but it look fine.
make sure your settings are the same as mine. |
Hi, it seems that this repo is inactive for a while... (more than a year 😟) |
Hi! I'm in trouble because I can't learn. Which part of train_v2.py can I change to run it? |
It seems that the versions of python, tensorflow and Keras are important. You can find the following description in the repository Python 3.5.2 I have also verified that it works with the following environments Python 3.6 |
thank you for reply. I matched the version but it doesn't work. |
When I first training it, it didn't matter if the LOGS_PATH folder (the default is I run the following command
To check the training results. I run the following command
In web browser, go to http://localhost:6006/ |
I'm sorry to reply late.When I wrote the above command as it is, I got the following error. |
"nano" specifySpecify "nano" if you do not specify a "verification" file or if it does not exist. As you can see around the following lines of code, "train_v2.py" trains "training" and "validation" in a 9:1 ratio, regardless of the "validation" file is specified. Line 439 in f4a9c40
Why the error occurredI think this is because the following lines of code were not executed because the folder for the temporary files was left undeleted in the event of an abnormal exit, for example. Line 82 in f4a9c40
Procedure before executing the commandBefore executing the command, you must delete the temporary folder and move the resulting folder.
|
@tfukumori """ Also,How can I adjust the value on the horizontal axis of training loss? |
"CUDA" error.I'm not sure about the "CUDA" error. From the error message, it is possible that the GPU is not powerful enough, but I'm not sure. If it's due to a lack of GPU performance, then running it on the CPU or reducing the number of batches might solve the problem. (It's a trade-off for performance.) https://jp.mathworks.com/matlabcentral/answers/427234-what-is-the-cause-of-cuda_error_launch_failed Adjust the value on the horizontal axis of training lossIf you mean to change the settings of the graph, I don't know. If you mean the number of epochs, then it seems to vary with the number of images and batches. Line 149 in f4a9c40
Line 199 in f4a9c40
|
@tfukumori Why is "tmp_pred_files" empty before and after learning? Also, When running yolo.py in full HD, is it better to change the following numbers? Line 28 in e6598d1
|
Is it this number as a result of multiplying by 100? |
I think this will be helpful. You can find it here: https://qiita.com/mdo4nt6n/items/08e11426e2fac8433fed |
Provide useful debug information on tensorboard
mAP scalars
Images
Distributions
Histograms